Species Distribution models (SDMs, also called habitat suitability models or ecological niche models, depending on the goal) involves the identification of areas of likely species occurrence by learning a relationship between known occurrences and environmental variables (covariates). The ‘learning’ component of species-environment relationships can be accomplished by common machine learning algorithms, such as random forest or Maximum Entropy (MaxEnt). For this short course, the inner-workings of these algorithms are beyond the scope of our available time. There are many papers describing SDMs, but if you are new to the field, these papers by Elith & Leathwick 2009 and Valavi et al. 2021 are a useful place to start.
SDMs are used for a range of purposes, including understanding drivers of species distributions, conservation planning and risk forecasting. This tutorial is likely to be challenging for a beginner in spatial analysis, but is designed to try to balance thoroughness with simplicity.
Our overall goal is to produce a prediction map of the most suitable habitat for a species of interest and to better understand the environmental drivers behind its current distribution. Along the way, we will carefully process our data to make sure we have the best inputs and then provide a model evaluation procedure to determine how robust our findings are.
IMPORTANT NOTES:
Figure 1. The expected workflow and end product of session 2.
library(tidyverse)
library(rnaturalearth)
library(rnaturalearthdata)
library(flexsdm)
library(corrplot)
library(sf)
library(terra)
library(here)
library(SDMtune)
library(virtualspecies)
library(exactextractr)
Step 5.
In this session, we will use both raster and vector data.
Unfortunately, there is no consensus on which libraries to use for these
date types, and even where there is, old packages may not have updated
to the new libraries yet. Because of this, we will be jumping between
packages for handling rasters (raster and
terra libraries) and vectors (sf and
terra). We start by loading in our vector data (points and
country boundary) by using terra::vect():
# Name your species
spp <- c("Dracaena reflexa")
spp_code <- gsub(' ', '_', spp)
#### Load in data ----
# Read species data
all_pts <- vect(here(paste0('output/species_localities/',spp_code, '_all_points.shp')))
# using the rnaturalearth package, download the country border and return it as an sf object
mad_sf <- ne_countries(country = 'Madagascar', scale = 'medium', returnclass = 'sf')
mad <- vect(mad_sf)
It is really important to only predict our habitat suitability to a
region where the species could feasibly occur. In this practical, we are
looking at species within Madagascar, which is a distinct geographical
and biogeographical region. This makes our life a lot easier. However,
in some cases, you may be working with species that don’t comfortably
fit into a “modeling box”. In this case, we may want to define a
calibration area (or area of interest) where our species could feasibly
occur. We do not want to select an area too big (predicting outside of
our species possible range) or an area too small (not predicting across
its full range or leaving out useful locality data). Take a look at the
function flexsdm::calib_area() on help to produce a useful
calibration area.
Load in our environmental variables (WorldClim), changing the names, cropping and masking the dataset to Madagascar’s boundaries. Lastly, we re-scale the temperature variables. For ease of storage, the values are stored as whole numbers (i.e. no decimal places) by multiplying them by 10. So we need to divide them all by 10 to get the real values.
# load worldclim data
worldclim <- rast(here("data/rast/afr_worldclim.tif"))
names(worldclim) <- c("mean_ann_t","mean_diurnal_t_range","isothermality", "t_seas", 'max_t_warm_m','min_t_cold_m',"t_ann_range",'mean_t_wet_q','mean_t_dry_q','mean_t_warm_q','mean_t_cold_q','ann_p', 'p_wet_m','p_dry_m','p_seas','p_wet_q','p_dry_q','p_warm_q','p_cold_q')
wc_mad <- worldclim %>%
crop(., mad) %>%
mask(., mad)
# Re-scale temperature values
wc_mad[[c(1:2,5:11)]] <- wc_mad[[c(1:2,5:11)]]/10
wc_mad[[3:4]] <- wc_mad[[3:4]]/100
Let’s plot all of our data together now (1 raster layer, Madagascar boundary and all points):
#### Visualise raw data ----
plot(wc_mad$mean_ann_t)
plot(mad, add = T)
plot(all_pts, add=T)
Check and fix the projections.
#### Check projections ----
crs(wc_mad) == crs(mad)
## [1] FALSE
crs(wc_mad) == crs(all_pts)
## [1] TRUE
crs(mad) == crs(all_pts)
## [1] FALSE
mad <- project(mad, wc_mad)
crs(wc_mad) == crs(mad)
## [1] TRUE
We now need to process our environmental variables a bit more thoroughly. To do this, we use a simple pearson correlation and aim to identify a reasonable threshold (usually 70% or 0.7). This threshold is a judgement to decide whether different variables/layers may be providing redundant (i.e. very similar) information to one another. For example, precipitation in the driest quarter & precipitation in the driest month tend to provide the same information. We want to do our best to remove this redundancy. What’s important to keep in mind, is that even though we may only use one of these correlated layer/s in our model and remove the others, if the layer we kept in our model is important for the predicting our species distribution, it is likely our removed layer/s will be important too!
First, we make a correlation plot, to identify our correlations:
# Using Pearson correlation
cov_colin <- correct_colinvar(wc_mad, method = c('pearson', th = "0.7"))
# Take a look at the correlations using corrplot
corrplot(cov_colin$cor_table, type = 'lower', diag = FALSE, order = 'hclust', tl.cex = 0.6, tl.col = 'black')
Next, we use a recursive approach using hierarchical classification of groups of similar variables. Variables will be placed in groups with other variables most similar to them. In some cases, this may mean that variables with marginal correlation (e.g. just over 0.7), may be placed in with variables with greater correlations (e.g. >0.8). For this reason, we will keep running the function until we no longer have any correlations in our dataset.
# remove collinearity
set.seed(42)
non_colin <- removeCollinearity(raster::stack(wc_mad), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)
non_colin
## [1] "mean_ann_t" "mean_diurnal_t_range" "isothermality"
## [4] "t_seas" "mean_t_wet_q" "p_warm_q"
## [7] "p_wet_q" "p_seas"
These variables are the ones that have no other clear correlations with them, based on the first round of clustering. Let’s run it again to see if found everything:
wc_mad_sel <- wc_mad[[non_colin]]
set.seed(42)
non_colin_check <- removeCollinearity(raster::stack(wc_mad_sel), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)
There are still some variables showing a correlations, so run this once more.
wc_mad_sel <- wc_mad_sel[[non_colin_check]]
set.seed(42)
non_colin_check_2 <- removeCollinearity(raster::stack(wc_mad_sel), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)
## - No multicollinearity detected in your data at threshold 0.7
We now have a set of 7 variables that we will use for our model.
Step 6.
The next steps are important to complete our final dataframe for the modeling process. Essentially the dataframe format is the same structure as what you typically need for most models. For each locality (both presences and pseudoabsences), we need to provide whether the species in present/absent and all of its associated environmental. We then run our model in the following format:
\[ response \sim predictor/s \]
In our case, the equation will take the following broad structure:
\[
presence \sim worldclim1 + worldclim2 + worldclim2 + ...
\] There are many ways to build this dataset. For example, we
could go the route of using the terra::extract() function,
which will produce our desired dataframe structure in a generic format,
which most models will accept. In our case, we will use a function that
is designed specifically for the SDMtune package:
prepareSWD(). This function will run the extraction
function under the hood, so we need to provide it with our presence
& absence datasets and our environmental predictors:
# Prepare a SWD (Sample with Data), which is a class of data specifically used in the SDMtune package
all_pts_df <- as.data.frame(all_pts, geom = 'XY')
SWDdata <- prepareSWD(
species = 'Aristida rufescens',
p = all_pts_df %>% filter(pr_ab == 1) %>% dplyr::select(x, y),
a = all_pts_df %>% filter(pr_ab == 0) %>% dplyr::select(x, y),
env = wc_mad_sel
)
## Extracting predictor information for presence locations...
## Info: 1 presence location is NA for some environmental variables, it is discarded!
## Extracting predictor information for absence/background locations...
# For some reason ID is included, so we need to remove this...
SWDdata@data <- SWDdata@data[-1]
set.seed(42)
sample_n(cbind(SWDdata@pa, SWDdata@data), 10)
## SWDdata@pa mean_ann_t mean_diurnal_t_range isothermality t_seas p_warm_q
## 1 1 21.1 14.8 0.67 24.81 505
## 2 0 17.0 11.7 0.63 25.92 799
## 3 1 22.6 9.4 0.59 23.76 691
## 4 1 20.3 11.5 0.68 22.37 930
## 5 0 21.5 9.4 0.62 23.38 1134
## 6 1 21.4 9.6 0.74 13.64 822
## 7 1 26.4 10.7 0.69 11.97 806
## 8 0 21.2 13.0 0.64 24.17 490
## 9 1 24.4 14.4 0.65 28.43 333
## 10 0 24.9 14.5 0.65 25.99 527
## p_wet_q p_seas
## 1 505 107
## 2 799 82
## 3 760 40
## 4 930 89
## 5 1134 55
## 6 822 97
## 7 1128 106
## 8 515 107
## 9 333 100
## 10 527 110
Unlike most frequentist statistics, we will not have an output with p-values to suggest whether our model performed well or not. To properly evaluate our model, we need to test our model predictions against data that it has not seen. In most cases, we do not have an alternative dataset to do this with. So instead, we will take a small portion of our data (called the test data) and leave it out of our model development. The data we keep to run the model with is called the training data. Typically, the split between training and test data is around 70% vs. 30%. In our case, we do not have a huge amount of data, so we want to keep as much as we can to train or teach the model enough about our data to make decent predictions. So we will split our data into 80% training and 20% testing data. This is the learning process in machine learning.
We also use what’s called a seed here. The train/test split is a random process. However, if we want the results to look the same on any machine that runs this random process, we can provide it with a set collection of random numbers. The seed can be any number, but as long as it’s the same number on everyone’s machines, it will produce the same output.
# Split locations in training (80%) and testing (20%) datasets
split_data <- trainValTest(SWDdata, test = 0.2, seed = 42)
train <- split_data[[1]]
test <- split_data[[2]]
Now that our data is extracted and we have split it into our training vs. testing data, we are ready to run almost any model type. In this case, we will use random forest, a commonly used machine learning algorithm.
Step 7.
It is very simple to run our model using the SDMtune
package. We simply called train() and specify the method/s
we want to use (this is the model) and our training data. Here we use
method = c('RF') to use a random forest model, but there
are many other possible options to use (run
?SDMtune::train() in your console to see the other options
available in the SDMtune package). Read up here on the
advantages of random forests. There are several parameters we could
change for the random forest model, but we will use the default
settings. The output of our model is a % estimate of the likelihood of
each pixel in our calibration area being a presence. This percentage is
calculated by counting the number of different decision trees
that decided a pixel was either a presence or an absence. If 73 trees of
the 100 trees consider the pixel a presence, then the output will
suggest 73% chance of presence.
set.seed(42)
rf_model <- train(method = c('RF'), data = train)
rf_model
## Object of class SDMmodel
## Method: RF
##
## Species: Aristida rufescens
## Presence locations: 150
## Absence locations: 151
##
## Model configurations:
## --------------------
## mtry: 2
## ntree: 500
## nodesize: 1
##
## Variables:
## ---------
## Continuous: mean_ann_t mean_diurnal_t_range isothermality t_seas p_warm_q p_wet_q p_seas
## Categorical: NA
Here is the predicted output for all of the training data we used to train the model, using both a probability, a majority-rule (0.5) threshold and two variable thresholds (0.4 and 0.6) output:
cbind(train@pa, predict(rf_model, data = train)) %>%
as.data.frame() %>%
rename(true_class = V1, predicted_class = V2) %>%
mutate(threshold_0.4 = ifelse(predicted_class >= 0.5, 1, 0),
threshold_0.5 = ifelse(predicted_class >= 0.5, 1, 0),
threshold_0.6 = ifelse(predicted_class >= 0.6, 1, 0)) %>%
DT::datatable()
Our model is now trained. Before we look at any exciting outputs, it is crucial to evaluate the performance of our model using the test data we kept behind earlier. There are dozens of available metrics to evaluate model performance. All of these are based on different interpretations of a confusion matrix, which is simpler than it sounds. A confusion matrix determines how many times our testing data accurately is classified a present (1) or absent (0). These give us our true positives or true negatives. (i.e. if our model predicts an absence and our testing data confirms this, it is a true negative).
However, in some cases, our model might predict an absence, but our testing data says there is a presence there. This would be a false negative. The last category is if our model predicts a presence, but our training data says that site is an absence, this would be a false positive. These 4 categories are then used to make a simple table, as shown below:
Classic confusion matrix layout
Read up more on confusion matrices here.
Let’s say we set a threshold where any prediction over 50% is a presence. Let’s see what our confusion matrix is for our test data:
cm <- confMatrix(rf_model, test = test, th = 0.5)
cm_format <- as.data.frame(rbind(c('Positive',cm[,2],cm[,3]), c('Negative',cm[,4],cm[,5])))
names(cm_format) <- c('Predicted classes','Positive', 'Negative')
# rownames(cm_format) <- c('Positive', 'Negative')
sketch = htmltools::withTags(table(
class = 'display',
thead(
tr(
th(rowspan = 2, 'Predicted classes'),
th(colspan = 2, 'Actual classes'),
),
tr(
lapply(rep(c('Positive', 'Negative'), 1), th)
)
)
))
DT::datatable(cm_format, container = sketch, rownames = F)
Our results here suggest that there are 33 true positives and 26 true negatives. However, we have incorrectly classified a few points. There are 12 false positives and 5.
The first, very simple evaluation metric we can use is accuracy. This simply calculates the number of observations correctly classified (true) over all of the observations:
\[ Accuracy = ((True Positive + True Negative) / Total Observations) * 100 \]
In our case our formula is:
\[ Accuracy = ((33 + 26) / 83 ) * 100 \] \[ Accuracy = 71.08 % \]
Accuracy is just one very simple measure of how well a model performs and many would argue not reliable. Values above 70% may be doing a decent to very good job. Anything below 50% is performing worse than chance and is a poor performing model.
There are a few other metrics we could use, such as the True Skill Statistic (TSS) or the Area Under Curve (AUC).
The formula for TSS is:
\[ TSS = True Positive Rate + True Negative Rate - 1 \] where: \[ True Positive Rate = True Positives / (True Positives + False Negatives) \] and:
\[ True Negative Rate = True Negatives / (True Negatives + False Positives) \]
So, to calculate our TSS from our confusion matrix we can substitute in our values from our confusion matrix, we first calculate our True Positive Rate:
\[ True Positive Rate = 33 / (33 + 5) = 0.87 \]
Followed by our True Negative Rate:
\[ True Negative Rate = 26 / (26 + 12) = 0.68 \]
And then combine them together in our TSS formula:
\[ TSS = True Positive Rate + True Negative Rate - 1 = 0.87 + 0.68 - 1 = 0.55 \]
AUC is slightly different. It is independent of the classification threshold value, because it is an aggregated value of testing the true positive rate versus false positive rate at several different threshold values (varying from 0, where all values are classified as positives, to 1 where all values are classified as negatives). It is calculated as the integral of the area under the curve of the receiver operator curve (ROC), as shown below:
An example of a Receiver Operator Curve. AUC is the value of the Area Under Curve of the ROC.
Here is the Receiver Operator Curve for our species:
# Receiver Operator Curve (ROC)
plotROC(rf_model, test = test)
TSS and AUC typically vary from 0-1, and values closer to 1 are
better performing models. TSS values >0.8, 0.6-0.8 and 0.2-0.6
indicate a good to excellent, useful and poor model performance,
respectively. AUC results were considered excellent for AUC values
between 0.9-1, good for AUC values between 0.8-0.9, fair for AUC values
between 0.7-0.8, poor for AUC values between 0.6-0.7 and failed for AUC
values between 0.5-0.6. This is how you calculate them using
SDMtune:
# True Skill Statistic
tss(rf_model, test = test)
## [1] 0.6052632
# Area Under Curve
auc(rf_model, test = test)
## [1] 0.8379501
Rather than just using a majority-rule threshold, there are many different thresholds we could apply, that would depend on what we want from our model. For example, if we wanted a model that never misses a True Positive, we would set our threshold very low (e.g. 0.1). If we wanted a model that never includes a False Positive then we could set our threshold very high (e.g. 0.9). There are tools available to help us maximise our True Positive Rate, while minimising our False Positive Rate. This is always a trade-off, so it takes some careful detective work and personal decisions about what kind of output you want to decide on an adequate threshold.
In our case, let’s select the value that maximises our True Positive Rate, while minimising our False Positive Rate.
# Selecting a threshold
ths <- thresholds(rf_model, test = test)
ths %>% DT::datatable()
Let’s select the value for maximum test sensitivity (True Positive Rate) plus specificity (False Positive Rate) in the 5th row and 2nd column, which is 0.382.
th = ths[5,2]
cm_custom <- confMatrix(rf_model, test = test, th = th)
We can now calculate our chosen True Positive Rate and False Positive Rate and add it to our ROC plot:
# True Positive Rate (Sensitivity)
TPR <- cm_custom$tp/(cm_custom$tp + cm_custom$fn)
# False Positive Rate (1 - Specificity)
FPR <- 1 - cm_custom$tn/(cm_custom$tn + cm_custom$fp)
# Receiver Operator Curve (ROC) showing best threshold
plotROC(rf_model, test = test) +
geom_point(aes(x = FPR, y = TPR), col = 'red', size = 3) +
labs(x = 'False Positive Rate (1 - Specificity)',
y = 'True Positive Rate (Sensitivity)')
Again, there is a whole world of possible metrics you could use. Read up more about the use of different metrics:
Step 8.
To determine which environmental covariates had the greatest influence on the modeled distribution of our species, we can use calculate the variable importance. There are many ways to do this and different models have different approaches (e.g. jackknife with MaxEnt). This function randomly permutes one variable at a time and calculates the decrease in training AUC values. The results are normalised to percentages.
vi <- varImp(rf_model, permut = 5)
plotVarImp(vi[,1:2])
These results suggest that the mean_diurnal_t_range (52.2 %) and p_seas (18.2 %) are the most important variables in our model. Remember to consider the colinearity relationships with the variables we excluded earlier.
We can also plot out the non-linear response of our species to each environmental variable when all other variables are set to their mean values. This gives us an idea of the direction that a variable may influence the response of presence/absence of our species, while the variable importance let’s us know about the magnitude. Let’s take a look at our top 2 variables:
SDMtune::plotResponse(rf_model, var = vi$Variable[1], marginal = TRUE, rug = TRUE)
SDMtune::plotResponse(rf_model, var = vi$Variable[2], marginal = TRUE, rug = TRUE)
Our final and perhaps most exciting step is to take a look at the prediction surface. This output gives us an idea of where the environment best correlates with the known locations (and randomised pseudo-absences) of our species. For this reason, we can call it a prediction of habitat suitability. For many biogeographical reasons, it is unlikely our species will occupy all of these spaces, but it gives us an idea of spaces where it potentially could be suitable for it to grow:
pred <- predict(rf_model, data = raster::stack(wc_mad_sel))
#### Export data
terra::writeRaster(rast(pred), here(paste0('output/species_distribution_model/',spp_code,'_prediction.tif')), overwrite = TRUE)
Convert the predicted raster to a
data.frame so that we can easily visualise it in ggplot2
using geom_tile()
pred_df <- as.data.frame(pred, xy = TRUE)
ggplot() +
geom_tile(data = pred_df, aes(x = x, y = y, fill = layer)) +
scale_fill_gradientn(colours = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"), na.value = NA,
name = 'Habitat\nsuitability') +
geom_sf(data = mad_sf, fill = NA) +
geom_point(data = all_pts_df %>% filter(pr_ab == 1), aes(x = x, y = y)) +
labs(x = 'Longitude', y = 'Latitude') +
theme_minimal()
In certain circumstances, we may be more interested in predicting a binary outcome of sites where the species is or is not likely to occur. We can do this by setting the threshold that we tinkered with earlier of where where habitat is likely to be suitable.
ths <- thresholds(rf_model, test = test)
ths %>% DT::datatable()
Let’s again select the threshold value that maximizes the sensitivity plus specificity:
th = ths[5,2]
# th = 0.5
print(paste('The chosen threshold value is:', th))
## [1] "The chosen threshold value is: 0.382"
We can now use this threshold value to alter our spatial prediction.
Compute the thresholded value and then covert it from a
raster to a data.frame.
pred_th <- pred >= th
pred_th_df <- as.data.frame(pred_th, xy = TRUE)
plot(pred_th)
Create the final threshold plot:
ggplot() +
geom_tile(data = pred_th_df, aes(x = x, y = y, fill = as.factor(layer))) +
scale_fill_manual(values = c("#2c7bb6", "#d7191c"), na.value = NA,
name = paste0('Habitat\nsuitability\nbinary (>= ',th,')'),
labels = c('absent','present','')) +
geom_sf(data = mad_sf, fill = NA) +
geom_point(data = all_pts_df %>% filter(pr_ab == 1), aes(x = x, y = y)) +
labs(x = 'Longitude', y = 'Latitude') +
theme_minimal()
Once we have identified the core regions of habitat suitability for our species (i.e. our thresholded presence-absence map), we can then use the output for several downstream outputs. In this example, we will calculate the proportion of our species core habitat suitability that overlaps with current designated protected areas in Madagascar.
The protected area data was downloaded from the World Database on Protected Areas as a set of 3 shapefiles for Madagascar. These spatial files contain lots of useful information on the type, size and purpose of the protected area.
Let’s first load this dataset in and filter it to only include terrestrial protected areas:
#### Identify overlap with Protected Areas ----
# load in protected area shapefiles. There are 3 files, so we want to load them all in together and then bind them into one file
prot_areas <- list.files(here('data/vect/WDPA_Madagascar'), pattern = '*.shp', full.names = TRUE)
prot_areas_list <- lapply(prot_areas, read_sf)
# bind the 3 files togther
prot_areas_all <- bind_rows(prot_areas_list) %>% filter(MARINE == 0)
As we’re doing area calculations, it is really important that we now convert all of our spatial data files to an equal area projection. This means that distance and area will be calculate more accurately than if we were using a longitude/latitude system.
#### convert to equal area projection
# convert the presence/absence raster
pred_th %>%
projectRaster(crs = 'EPSG:29702', method = 'ngb') -> pred_th_proj
# convert the protected areas
prot_areas_all %>%
st_transform(crs = 'EPSG:29702') -> prot_areas_all_proj
Let’s visualise the difference between the projections and see if you can tell the difference:
par(mfrow=c(1,2))
plot(rast(pred_th))
## Warning: [set.cats] setting categories like this is deprecated; use a two-column
## data.frame instead
plot(vect(prot_areas_all), add = TRUE)
plot(rast(pred_th_proj))
plot(vect(prot_areas_all_proj), add = TRUE)
We can now do some simple calculations. As we know the resolution of each grid cell, all we need to do is count how many grid cells there are that are presences (1’s) and how many there are in total (0’s + 1’s) and then multiply this by the cell size to get a basic area estimate.
# What is the area of species presences?
# we select and sum only the cells with 1's, then multiply this by the size of the raster cells and lastly divide this by meters to get a result in km2.
pres_area <- (sum(pred_th_proj[] == 1, na.rm = TRUE) * (res(pred_th_proj)[1]*res(pred_th_proj)[2]) / (1000^2))
paste('The area of species presences is',pres_area, 'km2')
## [1] "The area of species presences is 259045.12 km2"
# Calculate the area of all cells
all_area <- (sum(!is.na(pred_th_proj[])) * (res(pred_th_proj)[1]*res(pred_th_proj)[2]) / (1000^2))
paste('The area of all cells is',all_area, 'km2')
## [1] "The area of all cells is 614827.402 km2"
# And lastly calculate the percentage of coverage of our species across all of Madagascar
paste('The species presences cover',round(pres_area/all_area*100, 2), '% of Madagascar')
## [1] "The species presences cover 42.13 % of Madagascar"
We now want to work out what % of our species is found within Protected Areas:
# create custom function to calculate the proportion of area covered by each Protected Area
sum_cover <- function(x){
list(x %>%
group_by(value) %>%
summarize(total_area = sum(coverage_area)))
}
# extract the amount of area covered
extract_all <- exact_extract(pred_th_proj, prot_areas_all_proj, coverage_area = TRUE, summarize_df = TRUE, fun = sum_cover, progress = FALSE)
# add the names of the protected areas back on to our extraction
names(extract_all) <- prot_areas_all_proj$ORIG_NAME
# convert the list to a data frame
extract_df <- bind_rows(extract_all, .id = 'ORIG_NAME')
# take a look at the first 6 rows
head(extract_df)
## # A tibble: 6 × 3
## ORIG_NAME value total_area
## <chr> <dbl> <dbl>
## 1 Ankarafantsika 0 1695222369.
## 2 Andohahela 1 872005790.
## 3 Marojejy 1 752626637.
## 4 Tsaratanana 0 373949923.
## 5 Tsaratanana 1 1116599716.
## 6 Tsimanampesotse 0 2571107742.
We can now sum all of the area that overlaps with the protected areas for presences (i.e. 1’s) and divide this by the total area of all presences:
# we can now sum all of the area that overlaps with the protected areas for presences (i.e. 1's) and divide this by the total area of all presences
area_under_pas <- extract_df %>%
filter(value == 1) %>%
summarise(sum(total_area)/(1000^2))
paste(round(area_under_pas/pres_area * 100, 2),'% of the predicted presences are found within protected areas')
## [1] "23.59 % of the predicted presences are found within protected areas"
Our final step is to join our IUCN protected area categories onto our presence area data.frame. This will provide us with some information on what percentage of our species area is conserved under different categories. This provides important context on both the quality and quantity of protected areas overlapping with our species range:
iucn_cat <- prot_areas_all_proj %>%
st_drop_geometry() %>%
dplyr::select(ORIG_NAME, IUCN_CAT)
extract_df %>%
left_join(iucn_cat, by = 'ORIG_NAME') %>%
filter(value == 1) %>%
group_by(IUCN_CAT) %>%
summarise(area = sum(total_area)/(1000^2)) %>%
mutate(perc = round(area/sum(area) * 100, 2))
## # A tibble: 7 × 3
## IUCN_CAT area perc
## <chr> <dbl> <dbl>
## 1 Ia 1158. 1.9
## 2 II 16396. 26.8
## 3 IV 4834. 7.91
## 4 Not Applicable 4607. 7.54
## 5 Not Reported 22367. 36.6
## 6 V 8286. 13.6
## 7 VI 3453. 5.65
END